Variable Encoding

Types of Variables

A. Categorical Encoding

  1. Nominal Variable -
    - Where we do not bother about the arrangement of the categories
    Eg - Gender, City**

    • One Hot Encoding
    • One Hot Encoding with many Categories
    • Mean Encoding
  2. Ordinal Variable -

    - Where we worry about the order of the categories, These can be rearranged based on their ranks
    Eg - Education(Based on their Salary) => (Phd-1, MS-2 , B.Com-3)

    • Label Encoding
    • Target Guided Ordinal Encoding


Nominal Encoding

  1. One Hot Encoding-

    Germany France Spain
    o 1 0
    1 0 0
    0 0 1

    In The above example we have assigned one dummy variable to each Country Name and then after
    One of the new Variables can be dropped.

    Disadvantage- If we have many no of distinct categories then apply One Hot Encoding will
    increase the dimension of of data and hence time to compute

    ode = OrdinalEncoder(categories=[['Child','Teen','Adult','Old'],[1,2,3],[1,3,4,5,6,7],['Low','Medium','High'],['Alone','Small','Medium','Large']])
    
    df_train[['Age','Pclass','Ticket','Fare','Family']] = ode.fit_transform(df_train[['Age','Pclass','Ticket','Fare','Family']])
    
    ode = OrdinalEncoder(categories=[['Child','Teen','Adult','Old'],[1,2,3],[1,3,4,5,6,7],['Low','Medium','High'],['Alone','Small','Medium','Large']])
    
    df_train[['Age','Pclass','Ticket','Fare','Family']] = ode.fit_transform(df_train[['Age','Pclass','Ticket','Fare','Family']])
    
  2. One Hot Encoding with Multiple Variable-
    We can tackle this by using One Hot Encoding to the top 10 categories which occur most of the time and rest of the categories as one whole category

  3. Mean Encoding-
    We use this kind of Encoding where we have many categories like we have Pincodes of many citites then we can find the mean of Each Pincode that how many Times that Pincode returns Zero to the total no. of times that pincode occur.


Ordinal Encoding

  1. Label Encoding-

    Education
    BE 1
    MAS 2
    Phd 3

    Here we have assigned a value according to their rank

    df_test = pd.get_dummies(df_test, columns=['Sex','Embarked','Title'],drop_first=True)
    
    df_test = pd.get_dummies(df_test, columns=['Sex','Embarked','Title'],drop_first=True)
    
  2. Target Guided Ordinal Encoding-
    In This we calculate the mean of each category that how many times the output regarding that came 1 and then rank each category according to the order of their mean



Column Transformer

ddc1cc790ad81d1b72fea38d4238d406.png




B. Numerical Encoding

  1. Discritzation (Binning)
  2. Binarization

  1. Binning

9389198c3520a874147bde9f21122af1.png

1.1 Unsupervised Binning
    1.1.1 Equal Width Binning (Uniform Binning)
    1.1.2 Equal Frequency Binning (Quantile Binning)
    1.1.3 K Means Binning
1.2 Supervised Binning
    1.2.1 Decision Tree Binning
1.3 Custom Binning 
1.1 Unsupervised Binning
    1.1.1 Equal Width Binning (Uniform Binning)
    1.1.2 Equal Frequency Binning (Quantile Binning)
    1.1.3 K Means Binning
1.2 Supervised Binning
    1.2.1 Decision Tree Binning
1.3 Custom Binning 

1.1.1 Equal Width Binning

1bd6efaaf2af6915bc026f882bdd86f9.png

1.1.2 Equal Frequency Binning

2061fd9daad3a01da206972cd57b38c5.png

1.1.3 K Means Binning

78be7d634c10f1365f2329f955c40df3.png


b7cbba0857a866ac2caafebc4088589f.png

from sklearn.preprocessing import KBinsDiscretizer
kbin_age = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')
kbin_fare = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')
from sklearn.preprocessing import KBinsDiscretizer
kbin_age = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')
kbin_fare = KBinsDiscretizer(n_bins=10, encode= 'ordinal', strategy='quantile')